Initial data exploration

Description: Initial data exploration and description

Describing dataset meta-info: 1) Features 2) Size

I mainly used this notebook to also get a bit familiar with folium

From User Guide of T-Drive Data we have the following description of the data:

"This dataset contains the GPS trajectories of 10,357 taxis during the period of Feb. 2 to Feb. 8 2008 within Beijing. The total number of points in this dataset is about 15 million and the total distance of the trajectories reaches to 9 million kilometers. The average sampling interval is about 177 seconds with a distance of about 623 meters. Each file of this dataset, which is named by the taxi ID, contains the trajectories of one taxi"

Data

10357 files in folder, but 10337 unique ids. Why are some missing?

Find taxi ids, which are folder, but not in dataframe:

When going through the files with corresponding ids I found that all of them were empty. Therefore there are 10336 unique relevant taxi.

There are 17 million measurements in the dataset. On average there are 1708 measurements per taxi, with standard deviation of 4733.. This is quite a big deviation.

We can see that most of the taxi have less than 20k measurements. Out of 10336 taxi with measurmenets, 10298 have less than 20000 measurements. 38 taxi have more than 20k measurements.

As a matter of fact, most of the taxi have less than 2500 measurements.

4 taxi have more than 100k measurements. The taxi with most measurements has id 6275.

Looking at taxi 6275, it seems that it is non-stop always moving. And it has super high sampling frequency - almost every second - many longitudes and latitudes are replicates.

Folium test

Get 3rd february data

Duplicate measurements:

5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435
5020,2008-02-03 18:05:55,116.15852,39.87435

Visualising lats and longs of a couple taxi on Tuesday

Heatmaps and points

Testing out Folium's features.

Conclusions

In total there are 17 million measurements.

There were some taxies with no measurements logged at all. Most of the taxies have less than 2500 measurements, however there are some with over 100k measurements. The sampling frequency varies a lot.

There are replicates in measurements, even with exact same timestamp - this is most likely due to connectivity problems of devices, which makes it send same data multiple times as it had not received confirmation that previously sent data was received.

We can see that there are GPS locations which are way out of Beijing. Some not even on the same continent. Most of the data is indeed in Beijing and mostly within the 5th Ring.